Topic Identification and Analysis in Large News Corpora

نویسندگان

  • Sarjoun Doumit
  • Ali A. Minai
چکیده

The media today bombards us with massive amounts of news about events ranging from the mundane to the memorable. This growing cacophony places an ever greater premium on being able to identify significant stories and to capture their salient features. In this paper, we consider the problem of mining on-line news over a certain period to identify what the major stories were in that time. Major stories are defined as those that were widely reported, persisted for significant duration or had a lasting influence on subsequent stories. Recently, some statistical methods have been proposed to extract important information from large corpora, but most of them do not consider the full richness of language or variations in its use across multiple reporting sources. We propose a method to extract major stories from large news corpora using a combination Latent Dirichlet Allocation and with n-gram analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts

This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segment...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Multiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking

Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...

متن کامل

Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora

The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and rad...

متن کامل

Studying the Prominence-based Situation in Iranian Television News from a Persuasion Perspective

This article is part of a much broader study of persuasive techniques in television news in Iran. As a very important part of IRIB news IRINN has been chosen for this study. If we assume that one of the most influential television news, the power of persuasion to influence the audience is very important. It seemed that choosing the highest hypothetical level of news and deconstructing it would ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012